Network churn Load test - Add network policy enforcement latency measurement by agrawaliti · Pull Request #431 · Azure/telescope

agrawaliti · 2024-12-12T11:13:02Z

Integrate network policy enforcement latency measurement.

Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, namespaces, no. of policies per namespace can be updated in pipeline.

Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41

Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad

…k-churn

…NER variable

agrawaliti · 2024-12-30T15:31:55Z

@microsoft-github-policy-service agree company="Microsoft"

anson627 · 2025-01-07T01:14:30Z

 - name: run_id
  type: string
  default: ''
+- name: run_id_2


what is run id 2 for?

I am using two different pre created cluster for azure_cilium and azure_cni_overlay and I am passing those two clusters using run_id and run_id_2, as creating two new cluster for every run with 1000 nodes each takes a very long time, so I am passing two cluster tags to run tests on them.

On second thought I am thinking i can do it with terraform and schedule it to run periodically.

anson627 · 2025-01-07T01:25:36Z

 - name: ssh_key_enabled
  type: boolean
  default: true
+- name: use_secondary_cluster


what is secondary cluster for?

anson627 · 2025-01-07T01:29:01Z

+
+variables:
+  SCENARIO_TYPE: perf-eval
+  SCENARIO_NAME: cilium-network-churn


network-policy-churn

anson627 · 2025-01-07T02:05:22Z

+    parameters:
+      role: net
+      region: ${{ parameters.regions[0] }}
+  - template: /steps/engine/clusterloader2/cilium/scale-cluster.yml


can we do this in terraform when setting up cluster?

anson627 · 2025-01-07T02:05:43Z

 - name: run_id
  type: string
  default: ''
+- name: run_id_2


anson627 · 2025-01-07T02:05:48Z

  type: string
 - name: ssh_key_enabled
  type: boolean
+- name: use_secondary_cluster


jshr-w

Let's try to (1) Minimize changes that touch code that other pipelines use (2) After minimizing those changes, need to run the other automated pipelines off this branch to ensure they aren't broken.

jshr-w · 2025-01-09T00:26:09Z

 # Service test
 {{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 4000}}
-{{$SMALL_GROUP_SIZE := DefaultParam .SMALL_GROUP_SIZE 20}}
+{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}


Can we name this CL2_SMALL_GROUP_SIZE to keep the variable naming coordinated?

jshr-w · 2025-01-09T00:27:21Z

+{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}
 {{$bigDeploymentsPerNamespace := DefaultParam .bigDeploymentsPerNamespace 1}}
-{{$smallDeploymentPods := SubtractInt $podsPerNamespace (MultiplyInt $bigDeploymentsPerNamespace $BIG_GROUP_SIZE)}}
+{{$smallDeploymentPods :=  DivideInt $totalPods $namespaces}}


This is going to break all the other tests, right? Could you please restore this, and probably create a parameter for bigDeployments and set that to 0 instead.

jshr-w · 2025-01-09T00:28:05Z

  deleteStaleNamespaces: true
  deleteAutomanagedNamespaces: true
-  enableExistingNamespaces: false
+  enableExistingNamespaces: true


This may break testing. If namespaces weren't deleted by the previous run, we should be aware (many existing pipelines are dependent on this). Let's restore to the original value.

jshr-w · 2025-01-09T00:29:43Z

      objectTemplatePath: deployment_template.yaml
      templateFillMap:
-        Replicas: {{$bigDeploymentSize}}
+        Replicas: {{$bigDeploymentSize}}kube


Is this a typo?

jshr-w · 2025-01-09T00:37:28Z

        Group: {{.Group}}
        deploymentLabel: {{.deploymentLabel}}
+{{end}}
  - namespaceRange:


We don't want this code to execute if we are not running a network policy right? Shouldn't we 'else' gate this?

I dont wanna run bigDeployment for network test so i have added {{if not $NETWORK_TEST}} for big deployment

I mean, it works for your scenario but it will break the others -- If not NETWORK_TEST, what is stopping the pipeline from running both phases?

jshr-w · 2025-01-09T00:52:19Z

-def calculate_config(cpu_per_node, node_count, provider, service_test):
+def calculate_config(cpu_per_node, node_per_step, pods_per_node, provider):
    throughput = 100
-    nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE)


I think the changes made to pods_per_node are going to break many of the existing pipelines. After this change, how can the service test use 20 pods per node? We need to be careful adding parameters given the number of pipelines that are dependent on them... IMO the safest way will be to have an IF branch here, and possibly a parameter for pods per node ONLY for the network_test.

in my opinion having pods_per_node as a hard constant which can change frequently based on usecase in not a good approach, I have added that parameter in pipeline configuration so if we need custom value we can set it in pipeline, else if unset it will keep working as before i.e default - 40

the default isn't the same for all pipelines though, so something would break.

for consistency, instead let's take the same approach as #456, using the max_pods param to configure pods_per_node for this pipeline.

jshr-w · 2025-01-09T00:53:09Z

        "measurement": None,
        "result": None,
-        # "test_details": details,
+        # # "test_details": details,


jshr-w · 2025-01-09T00:55:00Z

+    parser_configure.add_argument("pods_per_node", type=int, help="Number of pods per node")
    parser_configure.add_argument("repeats", type=int, help="Number of times to repeat the deployment churn")
    parser_configure.add_argument("operation_timeout", type=str, help="Timeout before failing the scale up test")
+    parser_configure.add_argument("no_of_namespaces", type=int, default=1, help="Number of namespaces to create")


adding arguments without a default forced (need to set nargs) will probably break all the other pipelines in my understand... this comment applies to all arguments added

jshr-w · 2025-01-09T00:56:23Z

-      az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --node-taints "slo=true:NoSchedule" --labels slo=true
-      sleep 300
+      az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --labels slo=true test-np=net-policy-client
+      # sleep 300


This is going to affect all the other pipelines... please let's be careful!

Hello I pushed some test commits yesterday, I am cleaning it up. Thanks for pointing out

…st deployment template for network policy enforcement and refine cluster scaling parameters.

…late path for cluster scaling.

…stency

…d fix string formatting

…ing in execute.yml

…E for improved clarity

…M metrics

…es for improved accuracy

jshr-w · 2025-01-31T00:01:39Z

@@ -2,6 +2,7 @@ name: load-config

 # Config options for test type
 {{$SERVICE_TEST := DefaultParam .CL2_SERVICE_TEST true}}


This value is not explicitly overridden in slo.py, so you are going to create services in this test even though you set it to false in cilium-network-churn.yml. It has been fixed in main, please rebase. We do not want services created in this test.

jshr-w · 2025-01-31T00:05:40Z

+              node_per_step: ${{ parameters.node_per_step }}
+              max_pods: 100
+              pods_per_node: ${{ parameters.pods_per_node }}
+              repeats: ${{ parameters.repeats }}


We need to run this test with 10 repeats for data collection.

jshr-w · 2025-01-31T00:14:47Z

+            - eastus2
+          engine: clusterloader2
+          engine_input:
+            image: "ghcr.io/agrawaliti/clusterloader2:latest"


I do not understand why we are using this image. We should be using the latest image from Azure/perf-tests. Has a change been made to the perf-tests fork? In which case it needs to be reviewed so we can get an updated image. Otherwise, we should use the latest clusterloader2 image like the other pipelines.

…age version

…zure/telescope into network-churn

…k-churn

agrawaliti force-pushed the network-churn branch from 5f45a99 to e7ad48d Compare December 19, 2024 15:29

fix service churn feature pipeline name (#417)

d69ce5d

agrawaliti force-pushed the network-churn branch from 24c5c8b to d69ce5d Compare December 30, 2024 14:51

agrawaliti added 2 commits December 30, 2024 15:00

Merge branch 'main' of https://github.com/Azure/telescope into networ…

5024f5c

…k-churn

Refactor YAML files for network churn: clean up formatting and add OW…

a0599f3

…NER variable

agrawaliti marked this pull request as ready for review December 30, 2024 15:19

agrawaliti requested review from alyssa1303 and jshr-w December 30, 2024 15:19

agrawaliti changed the title ~~Network churn~~ Network churn Load test - Add network policy enforcement latency measurement Dec 30, 2024

Merge branch 'main' into network-churn

9814101

agrawaliti requested review from anson627, rafael-mendes-pereira and sumanthreddy29 as code owners January 6, 2025 16:19

anson627 reviewed Jan 7, 2025

View reviewed changes

agrawaliti added 2 commits January 8, 2025 10:46

Update nodes_per_nodepool value in validate-resources.yml to 500

b9aa5f6

Enable existing namespaces in load-config.yaml

e976e31

jshr-w suggested changes Jan 9, 2025

View reviewed changes

agrawaliti added 12 commits January 9, 2025 10:01

Remove redundant parameters from validate-resources.yml

9c20982

Disable existing namespaces in autoscale and slo configurations; adju…

9b55220

…st deployment template for network policy enforcement and refine cluster scaling parameters.

Enable existing namespaces in autoscale configuration and update temp…

bdb7267

…late path for cluster scaling.

Fix indentation in collect-clusterloader2.yml for template path consi…

6ced26d

…stency

Update load-config.yaml and slo.py for deployment size adjustments an…

2e3e8ee

…d fix string formatting

Add new parameters for namespaces, network policies, and service test…

568b69c

…ing in execute.yml

To test Commit

6ddeace

Revert and update deployment size

a3c8a83

test load condition

56df80f

Test valid datatype

b7e79fc

test string type

95e7006

test type

ed4c562

agrawaliti added 13 commits January 19, 2025 15:38

adding trigger

6beb6b1

refactor: update small group size parameter to use CL2_DEPLOYMENT_SIZ…

b19978a

…E for improved clarity

revert

b3ab5f9

refactor: disable unused policy creation metrics and remove legacy NP…

a2e1e68

…M metrics

refactor: update Azure NPM and Cilium metrics to use specific pod nam…

d7d3fc4

…es for improved accuracy

test

3061d90

test cilium pod

15bd627

Custom perf-test with npm metric

c40aacd

test npm metric

fb1184a

test

d4e2c0a

adding dummy time

2d9eea7

removing wait

512e931

fix: update load count formatting in clusterloader2 configuration

71df343

jshr-w suggested changes Jan 31, 2025

View reviewed changes

agrawaliti and others added 14 commits January 31, 2025 08:27

feat: add CL2_SERVICE_TEST configuration and update clusterloader2 im…

5e4aaa9

…age version

fix some params. tmp

b560445

fixes

a1f12fd

fix

4c05da4

fixes

7805cf0

schedule

11635a2

remove schedule

c43ab62

add schedule, change defaults

d8fb80c

incr to every 2 hrs

2e036fb

Merge branch 'jshr/network-churn-corrections' of https://github.com/A…

13d3d7e

…zure/telescope into network-churn

Merge branch 'main' of https://github.com/Azure/telescope into networ…

e4183ca

…k-churn

Comment out scheduling configuration in cilium-network-churn.yml

01a6e35

Fix syntax error in calculate_config function call in slo.py

a9f015a

Remove cl2_override_file argument from main function in slo.py

a302376

agrawaliti closed this Apr 17, 2025

agrawaliti deleted the network-churn branch April 17, 2025 14:10

		@@ -2,6 +2,7 @@ name: load-config

		# Config options for test type
		{{$SERVICE_TEST := DefaultParam .CL2_SERVICE_TEST true}}

Conversation

agrawaliti commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agrawaliti commented Dec 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agrawaliti Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jshr-w left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agrawaliti commented Dec 12, 2024 •

edited

Loading

agrawaliti Jan 9, 2025 •

edited

Loading